Report Assessment Ch2 Technology Visions

Data Management Report

Author
Affiliation

Rainer M. Krug

Doi
Abstract

A short description what this is about. This is not a tracditional abstract, but rather something else …

Working Title

IPBES_TCA_Ch2_technology

Code repo

Github - private

Build No: 78

%The BuidNo is automatically increased by one each time the report is rendered. It is used to indicate different renderings when the version stays the same%.

Introduction

All searches are done on all works in OpenAlex. The search in the TCA Corpus is not possibly at the moment, but we are working on it.

The following steps will be done in documented in this report:

Step 1: Determination of numbers

The search terms are based on the shared google doc. They are cleaned up for the usage in OpenAlex.

Vision

The search terms is vision Open Alex search.

Show the code
#|

vision_count <- openalexR::oa_fetch(
    title_and_abstract.search = vision_st,
    count_only = TRUE,
    output = "list",
    verbose = TRUE
)$count

Technology

The search terms is technology Open Alex search.

Show the code
#|

technology_count <- openalexR::oa_fetch(
    title_and_abstract.search = compact(technology_st),
    count_only = TRUE,
    output = "list",
    verbose = TRUE
)$count

Vision AND technology

Open Alex search.

The search term is vision AND technology

Count

Show the code
#|

vision_technology_count <-
    openalexR::oa_fetch(
        title_and_abstract.search = compact(paste0("(", vision_st, ") AND (", technology_st, ")")),
        output = "list",
        count_only = TRUE,
        verbose = TRUE
    )$count

Count Subfields

Show the code
#|

vision_technology_subfields <- openalexR::oa_query(
    title_and_abstract.search = compact(paste0("(", vision_st, ") AND (", technology_st, ")")),
    group_by = "primary_topic.subfield.id",
    verbose = TRUE
) |>
    openalexR::oa_request() |>
    dplyr::bind_rows() |>
    dplyr::arrange(key)

## clean up missing or wrong vision_technology_subfields$key_display_name
need_cleaning <- is.na(vision_technology_subfields$key_display_name) |
    !is.na(as.numeric(vision_technology_subfields$key_display_name))
Warning: NAs introduced by coercion
Show the code
fine <- !need_cleaning

vision_technology_subfields <- vision_technology_subfields |>
    dplyr::filter(fine) |>
    dplyr::select(key, key_display_name) |>
    dplyr::distinct() |>
    merge(y = vision_technology_subfields[need_cleaning, -2], by = "key") |>
    dplyr::bind_rows(vision_technology_subfields[fine, ]) |>
    dplyr::group_by(key, key_display_name) |>
    dplyr::summarize(count = sum(count))

Download technology AND vision Corpus

The corpus download will be stored in Ch2_technology/pages and the arrow database in data/Ch2_technology/corpus.

This is not on github!

The corpus can be read by running get_corpus() which o[pens the database so that then it can be fed into a dplyr pipeline. After most dplyr functions, the actual data needs to be collected via collect().

Only then is the actual data read!

Needs to be enabled by setting eval: true in the code block below.

Download TCA Corpus

Show the code
#|

tic()
pages_dir <- file.path(".", "data", "Ch2_technology", "pages")

dir.create(
    path = pages_dir,
    showWarnings = FALSE,
    recursive = TRUE
)

years <- oa_fetch(
    title_and_abstract.search = compact(paste0("(", vision_st, ") AND (", technology_st, ")")),
    group_by = "publication_year",
    paging = "cursor",
    verbose = FALSE
)$key

#######
#######
processed <- list.dirs(
    path = pages_dir,
    full.names = FALSE,
    recursive = FALSE
) |>
    gsub(
        pattern = paste0("^pages_publication_year=", ""),
        replacement = ""
    )

interrupted <- list.files(
    path = pages_dir,
    pattern = "^next_page.rds",
    full.names = TRUE,
    recursive = TRUE
) |>
    gsub(
        pattern = paste0("^", pages_dir, "/pages_publication_year=", ""),
        replacement = ""
    ) |>
    gsub(
        pattern = "/next_page.rds$",
        replacement = ""
    )

completed <- processed[!(processed %in% interrupted)]

years <- years[!(years %in% completed)]
#######
#######

pbmcapply::pbmclapply(
    sample(years),
    function(y) {
        message("\nGetting data for year ", y, " ...")
        output_path <- file.path(pages_dir, paste0("pages_publication_year=", y))

        dir.create(
            path = output_path,
            showWarnings = FALSE,
            recursive = TRUE
        )

        data <- oa_query(
            title_and_abstract.search = compact(paste0("(", vision_st, ") AND (", technology_st, ")")),
            publication_year = y,
            options = list(
                select = c("id", "doi", "authorships", "publication_year", "display_name", "abstract_inverted_index", "topics")
            ),
            verbose = FALSE
        ) |>
            IPBES.R::oa_request_IPBES(
                count_only = FALSE,
                output_path = output_path,
                verbose = TRUE
            )
    },
    mc.cores = 1,
    mc.preschedule = FALSE
)

toc()

Convert TCA Corpus to Arrow

The fields author and topics are serialized in the arrow database and need to be unserialized by using unserialize_arrow() on a dataset containing the two columns.

Show the code
tic()

pages_dir <- file.path(".", "data", "Ch2_technology", "pages")
arrow_dir <- file.path(".", "data", "Ch2_technology", "corpus")

years <- list.dirs(
    path = pages_dir,
    full.names = TRUE,
    recursive = FALSE
)

years_done <- list.dirs(
    path = arrow_dir,
    full.names = TRUE,
    recursive = FALSE
)

years <- years[
    !(
        gsub(
            x = years,
            pattern = paste0("^", pages_dir, "/pages_publication_year="),
            replacement = ""
        ) %in% gsub(
            x = years_done,
            pattern = paste0("^", arrow_dir, "/publication_year="),
            replacement = ""
        )
    )
]

pbapply::pblapply(
    years,
    function(year) {
        message("\n     Processing year ", year, " ...\n")
        pages <- list.files(
            path = year,
            pattern = "^page_",
            full.names = TRUE,
            recursive = TRUE
        )
        invisible(
            pbmcapply::pbmclapply(
                pages,
                function(page) {
                    data <- readRDS(file.path(page))$results |>
                        openalexR::works2df(verbose = FALSE)
                    data$author_abbr <- IPBES.R::abbreviate_authors(data)
                    data <- serialize_arrow(data)

                    data$page <- page |>
                        basename() |>
                        gsub(pattern = "^page_", replacement = "") |>
                        gsub(pattern = ".rds$", replacement = "")

                    arrow::write_dataset(
                        data,
                        path = arrow_dir,
                        partitioning = c("publication_year", "page"),
                        format = "parquet",
                        existing_data_behavior = "overwrite"
                    )
                },
                mc.cores = 6 # params$mc.cores
            )
        )
    }
)
toc()

Filter Corpus with TCA Corpus

Show the code
ids_technology <- read_corpus(file.path("data", "Ch2_technology", "corpus")) |>
    dplyr::select(id) |>
    collect() |>
    unlist()

ids_tca <- read_corpus(file.path("..", "IPBES_TCA_Corpus", "data", "tca_corpus", "Ch2_technology", "corpus")) |>
    dplyr::select(id) |>
    collect() |>
    unlist()

ids_subs_tca <- ids_technology[ids_technology %in% ids_tca]

arrow_tca_dir <- file.path(".", "data", "Ch2_technology", "corpus_tca")
arrow_dir <- file.path(".", "data", "Ch2_technology", "corpus")


year_dirs <- list.dirs(
    path = arrow_dir,
    full.names = TRUE,
    recursive = FALSE
)

year_done <- list.dirs(
    path = arrow_tca_dir,
    full.names = TRUE,
    recursive = FALSE
)

year_dirs <- year_dirs[!(basename(year_dirs) %in% basename(year_done))]

years <- basename(year_dirs) |>
    gsub(
        pattern = "publication_year=",
        replacement = ""
    )
ys <- seq_len(length(year_dirs))


pbapply::pblapply(
    ys,
    function(y) {
        data <- read_corpus(year_dirs[[y]]) |>
            dplyr::collect() |>
            dplyr::filter(id %in% ids_subs_tca)
        if (nrow(data) > 0) {
            data |>
                dplyr::mutate(publication_year = as.integer(years[[y]])) |>
                arrow::write_dataset(
                    path = arrow_tca_dir,
                    partitioning = "publication_year",
                    format = "parquet",
                    existing_data_behavior = "overwrite"
                )
        }
    }
)

toc()

Extract 50 random papers from technology AND vision in TCA Corpus

Show the code
#|

fn <- file.path("data", "Ch2_technology", "random_50_technology_in_tca.xlsx")

if (!file.exists(fn)) {
    set.seed(14)
    read_corpus(file.path("data", "Ch2_technology", "corpus_tca")) |>
        dplyr::select(id, author_abbr, display_name, ab) |>
        dplyr::rename(abstract = ab, title = display_name) |>
        dplyr::collect() |>
        dplyr::slice_sample(n = 50) |>
        dplyr::mutate(
            abstract = substr(abstract, start = 1, stop = 5000)
        ) |>
        writexl::write_xlsx(path = fn)
}

Results

vision

Hits for search term vision: 57,750,272 hits in OpenAlex.

Individual terms cobmbined by OR:

Show the code
#|

assess_search_term(
    readLines(file.path("input", "Ch2_technology", "vision.txt")),
    excl_others = TRUE
) |>
   dplyr::arrange(desc(count)) |>
    dplyr::mutate(count = formatC(count, format = "f", big.mark = ",", digits = 0)) |>
    knitr::kable()
term count
value 6,687,930
image 3,722,015
objective 3,430,442
view 3,098,422
strategy 2,970,560
target 2,719,290
future 2,258,694
project 2,227,576
policy 2,166,326
transition 1,914,968
plan 1,807,613
perspective 1,707,194
transmission 1,371,865
goal 877,433
perception 863,425
platform 803,328
movement 775,625
scenarios 553,005
desire 435,790
visualization 369,132
discourse 314,099
mission 302,164
hope 267,719
initiative 261,665
intention 194,792
fiction 169,695
wish 162,978
agenda 159,312
aspiration 150,849
creativity 112,006
imagery 109,290
co-production 108,662
cosmology 105,147
dream 104,626
sight 82,637
imagination 70,430
inspiration 66,579
visualisation 42,822
fantasy 31,793
roadmap 23,962
programm 23,517
archetype 23,501
worldview 19,881
visionary 9,114
foresight 8,647
cosmovision 1,809
“deliberate process” 170
cosmocentric 28
vision 0
visioning 0

technology

Hits for search term technology: 13,936,840 hits in OpenAlex.

Individual terms cobmbined by OR:

Show the code
#|

assess_search_term(
    st = readLines(file.path("input", "Ch2_technology", "technology.txt")),
    excl_others = TRUE
) |>
    dplyr::arrange(desc(count)) |>
    dplyr::mutate(count = formatC(count, format = "f", big.mark = ",", digits = 0)) |>
    knitr::kable()
term count
Technology 5,111,798
Software 1,932,595
“Social Media” 961,039
Internet 734,765
Virtualization 688,302
Robotics 653,056
“Machine Learning” 475,519
“Deep Learning” 278,527
“Renewable Energy” 176,036
Biotechnology 170,242
“Artificial Intelligence” 169,188
“API” 120,387
“Big Data” 111,756
Nanotechnology 76,265
“Computer Vision” 73,706
“E-commerce” 73,352
5G 65,420
“Cloud Computing” 60,292
“Speech Recognition” 51,331
Blockchain 50,255
“Natural Language Processing” 42,802
“3D Printing” 40,905
“Augmented Reality” 37,556
“Smart Grid” 35,071
“Circular Economy” 30,415
“Autonomous Vehicle” 26,926
“Digital Transformation” 22,781
Cybersecurity 20,926
“Clean Energy” 19,338
“Quantum Computing” 18,756
“Data Science” 16,073
“Cyber-Physical Systems” 15,496
“Edge Computing” 14,749
“Smart Home” 13,954
“Digital Twin” 13,673
Cryptocurrency 12,123
Fintech 9,410
“Machine-to-Machine” 7,680
“Mixed Reality” 5,706
“Facial Recognition” 5,315
Microservices 4,418
“Digital Currency” 2,719
“Application Programming Interface” 2,092
“Agile Development” 2,036
DevOps 1,830
Virtualisation 963
“Digital Wallet” 466
“Digital Ethics” 244
“Virtual Reality” 0
“Internet of Things” 0
“Sustainable Technology” 0
“Wearable Technology” 0
“Genetic Engineering” 0
“Internet Privacy” 0
“Internet Safety” 0
“Genetic engineering” 0
“Space Technology” 0
“Computational Technology” 0

vision AND technology in TCA Corpus

For the TCA Corpus, we do have

Show the code
read_corpus(file.path("data", "Ch2_technology", "corpus_tca")) |>
    dplyr::select(id) |>
    dplyr::collect() |>
    nrow()
Warning: Invalid metadata$r
[1] 643474

number of works.

Random Sample of 50 works

The file contains the id, author_abbr and abstract of the papers. Two samples were generated:

  • works in the technology corpus AND in the TCA corpus which can be downloded here.

Subfields

The subfields are based on the main topic assigned to each work. There are other topics also assigned, but this one has been identified as the main topic by an algorythm. count is the number of works in the vision AND technology corpus which have been assigned to the subfield.

Please take a look at these subfields of the topics to identify the ones to be filtered out.

The easies would be to download the Excel file through the button and to mark the subfields to be filtered out.

Show the code
IPBES.R::table_dt(vision_technology_subfields, fixedColumns = NULL, fn = "Vision Technology Subfields")

Reuse

Citation

BibTeX citation:
@report{krug,
  author = {Krug, Rainer M.},
  title = {Report {Assessment} {Ch2} {Technology} {Visions}},
  doi = {XXXXXX},
  langid = {en},
  abstract = {A short description what this is about. This is not a
    tracditional abstract, but rather something else ...}
}
For attribution, please cite this work as:
Krug, Rainer M. n.d. “Report Assessment Ch2 Technology Visions.” IPBES Data Management Report. https://doi.org/XXXXXX.